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Improving protein identification from peptide mass 
fingerprinting through a parameterized multi-level 
scoring algorithm and an optimized peak detection 

We have developed a new algorithm to identify proteins by means of peptide mass fin- 
gerprinting. Starting from the matiix=assistediaser^ 

(MALDI-TOF) spectra and environmental data such as species, isoelectric point and 
molecular weight, as well as chemical modifications or number of missed cleavages of 
a protein, the program performs a fully automated identification of the protein. The first 
step is a peak detection algorithm, which allows precise and fast determination of pep- 
tide masses, even if the peaks are of low intensity or they overlap. In the second step 
the masses and environmental data are used by the identification algorithm to search 
in protein sequence databases (SWISS-PROT and/or TrEMBL) for protein entries that 
match the input data. Consequently, a list of candidate proteins is selected from the 
database, and a score calculation provides a ranking according to the quality of the 
match. To define the most discriminating scoring calculation we analyzed the respec- 
tive role of each parameter in two directions. The first one is based on filtering and 
exploratory effects, while the second direction focuses on the levels where the parame- 
ters intervene in the identification process. Thus, according to our analysis, all input 
parameters contribute to the score, however with different weights. Since it is difficult 
to estimate the weights in advance, they have been computed with a generic algorithm, 
using a training set of 91 protein spectra with their environmental data. We tested the 
resulting scoring calculation on a test set of ten proteins and compared the identifica- 
tion results with those of other peptide mass fingerprinting programs. 

Keywords: Mass spectrometry / Peak detection / Peptide mass fingerprinting / Protein tdentif ica- 
tfon EL3747 



1 Introduction 



One of the tasks of proteomics is to identify the proteins 
expressed by an organism or tissue 11]. This requires sev- 
eral steps. The proteins are first isolated and some pro- 
tein-specific attributes are measured. A protein sequence 
database is then screened in order to retrieve the protein 
or proteins that best match these attributes. Until recently, 
the attributes were most commonly determined by chemi- 
cally extracting amino acid sequence information [2]. 
While these methods are reliable and can be fully-auto- 
mated, they are slow and do not allow high throughput 
identification. Hence new techniques for protein identifica- 
tion had to be developed. A major impetus came from 
mass spectrometry of large molecules. New methods 
such as MALDI [3, 4] and electrospray ionization (ESI) 
[5], as well as new spectrometers [6] became available 
and made it possible to analyze proteins in small concen- 
trations in a short time. Among the various spectrometric 
methods are: Fourier transform mass spectrometry 
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{FTMS) [7], which provides a high mass resolution, and 
quadrupole time of flight (QTOF) [8], where ions of a small 
mass range are selected by a quadrupole ion trap and 
then transferred to a collision chamber before their frag- 
ments are analyzed. Furthermore, there are reflectron 
time-of-flight spectrometers (MALDI-TOF and ESI-TOF), 
which allow the measurement of masses in a large range 
with sufficient precision. 

Currently the most common method to identify proteins is, 
first, to enzymatically digest the proteins, then to deter- 
mine the masses of the resulting peptides by peak detec- 
tion on a MALDI-TOF or ESI-TOF spectrum, and finally to 
use the peptide mass fingerprints to search proteins 
sequence databases for correct matches. Optimizing the 
peak detection and database search algorithms Is thus 
the key to improving protein identification from peptide 
mass fingerprints. 



1.1 Peak detection 

Peak detection is an important step in the identification 
process. Occasionally only a few experimental peptide 
masses in the fingerprint match the theoretical masses in 
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a database; thus failure to detect one peak can hinder the 
correct identification of a protein. On the other hand, if too 
many false peaks are considered, this may lead to errone- 
ous database matches. Furthermore, it is important to 
precisely determine the peptide masses. The peak detec- 
tion algorithm must also be able to correct calibration 
errors of the mass spectrometers. Finally, the process of 
peak detection should be fast and fully-automated in 
order to grant high throughput data handling. 

1.2 Identification 

The principle of protein identification using peptide mass 
fingerprinting is based on the comparison of the list of 
experimental masses with a database containing the the- 
oretical peptide masses of known proteins. The goal is to 
find the protein or proteins whose peptide masses provide 
the best match with the experimental fingerprint. It is 
worth mentioning that several other attributes of proteins 
may be useful in characterizing the likeness between the 
protein under investigation and identifying candidates 
from the database [9], Information about the species, the 
molecular mass or the isoelectric point of the whole pro- 
tein can be very helpful in selecting the right protein. 
Chemical modifications caused by biochemical mecha- 
nisms in the living cell or during the preparation of the 
experiment modify the peptide masses and also have to 
be taken into account while parsing the database. 

Several programs exist that perform this kind of protein 
identification. They all use some of the available attributes 
and search various protein sequence databases. The crit- 
ical question in this approach is to present the user with a 
ranking of the proteins that match the protein under inves- 
tigation, which considerably facilitates the interpretation 
of the identification results. Most programs show scores 
associated with each protein, thus giving a degree of con- 
fidence in the matching protein. The simplest scoring 
method is to count the number of matching peptide 
masses. This is applied by the PeptideSearch program 
(http://www.mann.embl - heidelberg.de/Services/Peptide 
Search/FR_PeptideSearchForm.html) which searches 
the nrdb database, as well as by the Peptldent program 
[1 0] (httpy/www.expasy.ch/tools/peptident.html) which 
searches the SWISS-PROT and TrEMBL databases [11 J. 
In addition, Peptldent uses some of the annotations from 
SWISS-PROT to refine its search, taking into account 
known protein modifications (post-translational and 
processing of precursor molecules into mature chains 
and peptides). The Mowse program (http://srs.hgmp. mrc. 
ac.uk/cgi-bin/mowse) [12J determines a score by consid- 
ering the frequency of each peptide mass in its database 
(OWL) in order to emphasize the rarest peptides. This 
score also takes into account the presence of missed 
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cleavages in matched peptides, modifying their weight in 
the score by a fixed factor (pFactor). The MS-Fit program 
(http://prospector.ucsf.edU/ucsfhtml3.2/msfit.htm) uses 
the same scoring method on the (non redundant) NCBInr 
database. ProFound [13] (http://prowl.rockefeIler.edu/cgi- 
bin/ProFound) calculates a probability for the identifica- 
tion of the right protein, given by a bayesian formula, and 
uses the distance between experimental and theoretical 
masses obtained from the NCBInr database. Finally, the 
MassProfile program, included in the Darwin library (htp:// 
cbrg.inf.ethz.ch/) [14] also determined an identification 
score based on the probability of randomly obtaining a 
match of n experimental masses with n theoretical 
masses, given the interval of possible masses and the 
maximum allowed distance of masses accepted in this 
match. 

All these algorithms utilize the various attributes pre- 
sented above to control the number of proteins consid- 
ered for the identification. However, they make little use 
of this information in their scoring calculation, since they 
use at most one or two of the attributes (distance between 
masses, presence of missed cleavages, mass distribution 
in the database, etc). These represent only a small part of 
the parameters that could influence the quality of identifi- 
cation. In order to better understand their respective role, 
we carried out a systematic study of the importance of 
each atttribute in the identification process. This led to the 
definition of a new scoring scheme that takes into account 
maximal information from each attribute, thus allowing for 
a better discrimination of candidate proteins and facilitat- 
ing the identification of the right protein. This paper first 
presents an optimized automated peak detection algo- 
rithm and then details a new protein identification method, 
as well as its associated scoring procedure. 

2 Materials and methods 

2.1 Materials 

2.1.1 Chemicals 

SDS-PAGE molecular weight standards were purchased 
from Bio-Rad Laboratories (Hercules, CA, USA). Sequen- 
cing-grade modified trypsin was purchased from Promega 
(Madison, Wl, USA). Trifluoroacetic add (TFA) and at- 
cyano-4-hydroxy-frans-cinnamic acid (ACCA) were pur- 
chased from Sigma (St. Louis, MO, USA). Acetonitrile 
(AcCN), HPLC-grade, was purchased from Fluka (Buchs, 
Switzerland). Methanol (analytical grade) and sodium 
bicarbonate were purchased from Merck (Darmstadt, 
Germany). Immobilized pH gradient strips were pur- 
chased from Amersham-Pharmacia-Biotech (Uppsala, 
Sweden). 
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2.1.2 Sample preparation 

Bio-Rad molecular weight standards were separated by 
1-D SDS electrophoresis [15]. The other protein samples 
were separated by two-dimensional gel electrophoresis 
(2-DE) [16]. Gels were stained with Coomassie Brilliant 
Blue (CBB) R-250 {0.1% w/v), methanol (30% v/v) and 
acetic acid (10% wV) for 30 min and were destained with 
repeated washes of methanol (40% v/v) and acetic acid 
(10% v/v) solution. Protein spots were excised and 
destained with 100 uX of a 50 mw ammonium bicarbonate 
solution at 37°C for 45 min. Destaining solution was 
removed and the gel pieces were dried under vacuum. 
The gel pieces were generally reswollen with 20 \it of 
20 mM ammonium bicarbonate and 4 \iL of 0.1 mg/mL of 
trypsin. The gel was dried to evaporate solvent and vola- 
tile salts, usually after overnight incubation at room tem- 
perature. Then. 20 \iL of 50% AcCN, 0.1% TFA were 
added for 10 min with sonication to extract peptides from 
the gel. 

2.1.3 Mass spectrometry 

Mass spectrometry measurements were performed on a 
MALDI-TOF mass spectrometer Voyager™ Elite (Per- 
Septive Biosystems, Framingham, MA, USA) equipped 
with 337 nm nitrogen laser. The analyzer was used in the 
reflectron mode at an accelerating voltage of 18-20 kV 
and a delayed extraction set to 100-140 ns. Laser power 
was generally set about 20% above threshold for matrix 



molecular ion production. Spectra were accumulated be- 
tween 10-256 times. The matrix solution used was 4 mg/ 
mL ACCA in 30-50% AcCN, 0.1% TFA. 

2.1.4 Computer hardware 

The programs are written in ANSI C++ and run on Unix 
and Windows systems. We also developed a Perl script 
that allows running the peak detection on a Windows PC 
and the database search on a Unix server. 

2.2 Peak detection in MALDI-TOF mass spectra 
2.2.1 Introduction 

A MALDI-TOF spectrum is a sampled signal, /.e., an array 
of floating point values that consists of trend, noise and 
peaks (Fig. 1). The trend or baseline is the signal pro- 
duced by the electronics of the mass spectrometer that 
one would obtain if no material entered the mass spec- 
trometer and in the absence of noise. It does not vary 
over small mass ranges (~ 10 Da). The noise, which is 
caused by electronic disturbances and fragments of 
material, varies over small mass ranges (< 1 Da) with lit- 
tle correlation, Le. t each array value varies randomly and 
almost independently of its neighboring values. Peaks 
have a more or less predefined shape (see below) and 
are therefore strongly correlated. The notion of a peak 
may be misleading, because one "peak" actually consists 



' ■ I 

I 


1 


; 


' ; i 1 « » 






2080 2090 



2120 



Figure 1. Mass spectrum of the 
Escherichia coli protein Prolyl - 
tRNA synthetase (SYP_ECOU, 
P16659) digested with trypsin. 
The spectrum was acquired with 
a Voyager Elite MALDI-TOF 
mass spectrometer. Note that 
the mass unit is M/Z, where /Wis 
the mass in Da and Z the 
amount of unit charges of a pep- 
tide, (a) Part of the spectrum 
containing peaks; (b) a noisy 
region. The — line shows the 
trend, while the upper and lower 
solid lines show the trend plus 
and minus the noise, respective- 
ly. 
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of several so-called isotopic peaks [17] (Figs. 2 and 5). 
Determining the monoisotopic mass of peaks is a long 
existing task [18, 19], and software accomplishing this 
task is usually delivered together with the spectrometer 
hardware. Unfortunately, the software we had at hand 
lacked the necessary flexibility and accuracy. Therefore a 
custom peak detection program had to be conceived. It 
has been designed to yield a precise and fast localization 
of all the peaks, even if these are small or if several peaks 
overlap. 

In order to detect peaks in a spectrum, we can apply a 
regression algorithm [20]. Let f a be a template or model 
for a peak, where at is a set of parameters (such as height 
and width). A peak is detected in a spectrum if its tem- 
plate fits a part of that spectrum, i.e., if a measure of dis- 
tance between the spectrum and the template Is-fJ has a 
local minimum and is smaller than a threshold value. The 
choice of t, is crucial. Three conditions must be fulfilled: 

(i) the match should be clear (high signal-to-noise ratio), 

(ii) it should be precise (low deviation due to noise), and 

(iii) it should be unique (local minima not too close to each 
other). The first two conditions can be solved analytically. 
They yield k = p a , where p a denotes an ideal peak as it 
would appear in a noiseless spectrum. The third condition 
results in blurring the template i.e., reducing its high 
frequency part and thereby smoothing the error function. 
Hence, it competes with the other two conditions. Canny 
[21] developed this theory for the case where only one 
pattern with a fixed shape is present. But as we will see 
below, we have to deal with peaks of variable shape, and 
the template has to adapt to these shapes. 

2.2.2 Application to MALDI-TOF mass spectra 

Since multiply charged peptides are rarely observed in 
MALDI-TOF spectra, we can assume that all peptides 
carry a single charge, thus the spacing between isotopic 
peaks is 1 Da. An isotopic distribution defines the proba- 
bilities that a molecule carries additional neutrons. For 
peptides of the same mass, there are several possible 
isotopic distributions. They depend on the atomic compo- 
sition and particularly on the number of sulfur atoms [22]. 
However, the differences are scarcely visible in a mass 
spectrum, because the atomic compositions of peptides 
with the same mass are similar. Thus, we only consider 
an average isotopic distribution calculated by averaging 
all peptides with a mass in [m, m + 1] that are obtained 
from in silico digestion of the proteins in the SWISS- 
PROT database [22]. Let p%° (/), i e [0. H denote the 
probability of having / additional neutrons in this average 
distribution. The probability of the monoisotopic peak 
pm°(0) decreases with increasing mass m (Fig. 2). This 
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Figure 2. Average isotopic distributions frr..h 0 ,h.o 
The monoisotopic part is marked with the label 'mono', (a) 
For m = 1000 Da: (b) m = 2000 Da; and (c) m = 3000 Da. 
0= 0.28, no = 0 and n= 1. 

feature is crucial for a correct peak detection in different 
peptide mass ranges. 

A template can now be obtained from pft? 

l itn', = t: ... t Jm') ^ ft. ~ ftY,P ! ••''3 -:-V ' M ) 

when h 0 is the offset, h the height, m the monoisotopic 
mass and a the width. The offset h 0 is necessary to cor- 
rect errors in the trend estimation (see Section 2.2.3). 

Let us now define the error function e: 

e 0v) = - — x n;n T isirr : - f T -, ,J. r r: )V (2) 

ft {rr,. + m : ) j - *Srv 

where m x are the sampling values of the spectrum and 
m t = 1 and m 2 ~ 5 define the window in which Eq. (2) is 
evaluated. It must be large enough to contain the bulk of 
U.h^h.n {m'). However if it is too large, the T.s?{m?) term 
would dominate, and the contribution of fm.h„.h.o (^0 to 
e(m) would become marginal: The division of the inte- 
grand by r? 2 (m- ( + m 2 ) normalizes e with respect to the 
height and the size of the template. We normalize with n 2 
because the shape of the template is an average of all 
possible shapes, and thus does not need to match a 
p ea k s even in the absence of noise. This deviation grows 
linearly with the height, and we reduce its effect by the 
normalization. The task is now to find conditions for the 
error function e that characterizes the peaks. The first 
condition is de/dm - 0, i.e., e has a local minimum with 
respect to m. Usually, several local minima m of Eq. (2) 
are found in a neighborhood (Fig. 3), and we accept m p 
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Figure 3. Error function (2) in the vicinity of a peak at 
1572.835 Da. The lowest minimum corresponds to the 
monoisotopic peak. Note the logarithmic scale of e. 

as a possible peak if e(m p ) < e(m$ t Vmj e [m p -(mt + m£l 
2, m p + (m, + mz)/2]. 

From these we select as real peaks those which satisfy 
the following conditions: 
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2.2.3 Implementation 

The first step is to remove the trend because the height of 
the peaks is only defined relative to this trend, and 
because the numerical accuracy is improved. Ripley [23] 
describes several methods. The crucial point is the 
robustness of a~m~ethl^7~£eTTO not follow 

localized deviations, like peaks. Here we choose a simple 
approach to attain a robust fit. The spectrum is split into 
several small windows ©, {width about 40 Da) , and s?°* 
and sfare calculated in each window, where s?* 6 and 
sf w are defined as follows: 50% of the values in Sj are 
larger than sT ed and 50% are lower, while 95% of the val- 
ues in ^ are larger than sf^and 5% are lower. Then we 
define the noise as = 2 (sF od -sf w ). Finally, sT d and n t 
are interpolated using cubic splines [24] to obtain the con- 
tinuous trend sT^m) and noise n(m) (Fig. 1). 



e(m p ) < e max 
n(m p ) > h mW 



■■} 



(3) 



where ©max and h mirt are thresholds and n is the estima- 
tion of the noise around the peak. Since the quality and 
height of the peaks vary, perfect values for ©max and hfnin 
that are able to distinguish all true* peaks from noise do 
not exist. If the values are restrictive, i.e., if e max >s small 
and /? mln is large, no lalse* peaks are detected, but we 
also lose some true' ones. Conversely, by increasing 
©ma* and decreasing h min , more arid more lalse' peaks 
appear (Fig. 6). 

The values of ©max and A? min are linked to the values of the 
parameters one has to choose for the database search. 
For example, if the search values are restrictive, i.e., the 
mass tolerance is low and the minima] number of peptide 
masses that must match is high, the values of e mBX and 
Aimin must be less restrictive in order for all true' peaks to 
be taken into consideration. The lalse* peaks do not 
change the result if they are not too abundant, because 
the probability that several of them match the same pro- 
tein, thus giving rise to a high score for this irrelevant pro- 
tein, is low. On the other hand, if the database search 
parameters are less restrictive, ©ma* and may not 
allow many lalse* peaks, because the probability of a 
false match is now higher and the result may change 
qualitatively. 



If m and o* are given, calculating no and h is straightfor- 
ward, thus minimizing Eq. (2). Hence we have to seek the 
minima e\ only in m and a. Since this algorithm is also 
used for high throughput processes such as the molecular 
scanner [25], execution time is crucial. A direct evaluation 
of Eq. (2) was too slow, and it is therefore necessary to 
perform a fast first search to find starting points for a more 
extended search. This first search is done by fixing o~ «= 
0.2 (this value neither produces too many minima, nor 
does it blur ©too much; see Section 2.2.1) and evaluating 
Eq. (2) for masses where the signal exceeds the noise. 
Because the template t varies slowly with the mass, it 
does not have to be evaluated for each mass. We then 
calculate the minima of Eq. (2) and use the resulting 
masses as starting points for a more precise fit, where 
both m and c vary. It is possible that two peptides have 
similar masses, so that their peaks overlap. In this case, 
the method described above may fail to detect both 
peaks. To solve this problem, all detected peaks are sub- 
tracted from the spectrum, and the algorithm is applied a 
second time. 

2.3 Calibration 

The TOF measured by a MALDI-TOF mass spectrometer 
can be affected with a significant error. After converting 
the TOF into the peptide mass [6], this can yield an error 
of up to 1 Da. However, most of that error can be cor- 
rected afterwards by a linear transformation: 



/7?caijb = am + b 



(4) 



The coefficients a and b can be determined in three differ- 
ent ways, (i) They are defined externally, (ii) They are cal- 
culated using internal standards, i.e., peptides with known 
masses that appear in the spectrum. This method works 
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well if internal standards are present and if they are 
detectable in the spectrum. Then it allows reducing the 
error to values smaller than 0.05 Da. (iii) They are calcu- 
lated with a maximum likelihood method. This method is 
based on the fact that the mass distribution of peptides is 
not at all uniform. First, the distribution peaks at certain 

--TTTasses-sepiarated by~1 Da; and second, it drops with 
higher masses {Fig. 4) (for a detailed discussion see 
[22]). Let P(m) Am be the probability of finding a mass in 
[m t m + Am]. For a set of peaks with masses m„ a and b 
are chosen to maximize the total probability T t P(am: + b) 
Am. This method is independent of internal standards, but 
it only works for initial errors that are smaller than 
0.5 Da, making the error less than 0.2 Da in most cases. 

2.4 Identification by peptide mass 
fingerprinting 

2.4.1 Problems 

Identification by peptide mass fingerprinting uses a set of 
experimental peptide masses obtained from the mass 
spectrum after peak detection, as well as information 
about the species, the isoelectric point or the molecular 
mass of the searched protein. These experimental 
masses are compared to a database of peptide masses, 
ie. t a database of in sil'tco digested proteins. The identifi- 
cation algorithm searches for the protein with the best 
match between its theoretical peptide masses and the 
experimental masses. Other attributes of the searched 
protein are also taken into consideration and matched to 
their corresponding values in the database. This method 
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involves various problems that influence the quality of the 
identification. 

First of all, we need to know which database to use, and 
how it has to be parsed. There are two approaches when 
using a database: either we search a database of protein 
sequences which is parsed linearly, each sequence being 
virtually digested to progressively determine the peptide 
masses, or we build an index of all possible peptide 
masses (sorted in ascending order) by an off-line diges- 
tion of a protein sequence database. In both cases we 
consider possible modifications and missed cleavages. 
The first method has the advantage of using less disk 
space, because everything is calculated on-line, and thus 
does not need to be stored. It also easily allows (by 
changing digestion rules) considering different enzymes 
for the digestion. Nevertheless, it could require a longer 
parsing time due to the fact that digestion operations, and 
especially all the combinations of modifications that could 
occur on peptides, have to be computed on-line. The sec- 
ond method has the advantage of retaining all possible 
peptide masses and therefore avoiding the combinatorial 
treatment of modifications during the search. Its draw- 
backs are the considerable additional space needed to 
store the index, as well as the time necessary to update it. 

A second problem arises from the large number of param- 
eters that influence the identification process. Indeed, as 
we have already seen, modifications and missed clea- 
vages can occur and modify a protein's peptide masses. 
If we allow for each theoretical peptide to carry zero, one 
or several modifications and for the enzyme to miss 0, 1 
or 2 cleavage sites, this strongly increases the number of 
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Figure 4. (a) Distribution of theoretical peptide 
masses obtained by a virtual digestion of all 
proteins in SWISS-PROT with trypsin (without 
missed cleavages or modifications). Note how 
the distribution drops for higher masses 
(b) Detail of (a). The distribution is peaked at 
given masses. 



BNSDOCID: <XP 2902845A U _I_> 



Electrophoresis 1 999, 20, 3535-3550 Improving protein identification from peptide mass fingerprinting 



theoretical peptide masses the program must consider. 
Also, identification algorithms use various thresholds that 
can appreciably modify the search results. Examples are 
the molecular weight range, the pi range, the minimum 
number of matches, the allowed difference between 
experimental and theoretical masses (mass tolerance), 
etc. For these reasons, the user should usually have an 
a priori idea of the experimental context, because an opti- 
mal choice of the parameter values will facilitate the inter- 
pretation of the results. 

The third problem results from the way the resulting pro- 
teins are ranked by the identification algorithm. Depend- 
ing on the parameters that have been selected for the 
search, the number of proteins in the database that match 
the experimental data can be very large. The program 
must therefore associate a score with each candidate pro- 
tein, and thus allow the confidence in. its match to be 
quantified. 

2.4.2 Parameters 

In order to choose an efficient method to handle the 
above-mentioned problems, the use of the parameters 
has been formalized. This has implications on the choice 
of the database structure and on the calculation of the 
score associated with each candidate protein. Parame- 
ters can be characterized in two ways. The first possibility 
concerns their effects on the quality and efficiency of the 
identification. The second possibility is linked to the level 
in the identification process at which a parameter inter- 
venes. 

When considering the first possibility one considers the 
fact that the parameters have two opposite effects during 
the search: an "exploratory" effect and a "filtering" effect 
The exploratory effect allows an increase in the size of 
the search space, that is, an increase in the number of 
candidate proteins. Indeed, the first difficulty of the identi- 
fication is to be sure to include the correct protein in the 
list of candidate proteins. Therefore the tolerance in the 
set of considered proteins and masses must be high 
enough to find the right protein; Parameters that are 
involved in this class of effects are: the type and number 
of modifications applied to proteins in the database, the 
maximum number of missed cleavages, the maximum 
distance between experimental and theoretical maisses, 
the minimum number of matched peptides necessary for 
a protein to be selected, and the number of peaks 
returned by the peak detection program . 

The second difficulty in the identification is to minimize 
the number of candidate proteins, in order to avoid losing 



important to efficiently filter the results and eliminate the 
least likely proteins from the list of candidates. Parame- 
ters with such a filtering effect include the species (to 
reduce the number of proteins to be considered), the mo- 
lecular mass and the isoelectric point (to eliminate pro- 
teins whose values are too far from the experimental 
ones). Moreover, some of the parameters mentioned 
above for their exploratory effects, like the maximum dis- 
tance of masses, the minimum number of matched pep- 
tides or the number of detected peaks, also have some fil- 
tering effects - depending on their thresholds. 

The main difficulty consists in finding a compromise be- 
tween these two aspects. On one hand, one wants to be 
sure to consider enough candidate proteins, therefore the 
exploratory effect has to be increased. On the other hand, 
one seeks to clearly identify the right protein and therefore 
has to filter the results. Depending on their exploratory or 
filtering nature, parameters may have a notable effect on 
the processing time needed for the identification. The 
more exploratory effects are used, the longer the search 
time will be. The sooner the filtering effects that are 
applied, the shorter it will be. The quality and efficiency of 
the identification will thus be highly dependent on the 
choices of the parameter values. 

The second method of characterizing parameters is 
based on the levels at which parameters participate in the 
identification process. In the case of two-dimensional 
electrophoresis (2-DE), three levels can be considered. 
The first one, the "mass level", corresponds to the choice 
of mass used to match a protein. At this level, we want to 
characterize the degree of match between a mass found 
in the spectrum and the mass of a peptide of the search 
protein. The next level, the "protein level", consists of the 
identification of a protein at a given position in the 2-DE 
gel. Information from the mass level is coupled with infor- 
mation about the whole protein, in order to determine the 
best candidate, protein. Finally, at the "contextual level", 
information about the two-dimensional environment (con- 
text) of the selected proteins from level 2 are taken into 
account to refine the identification at each position in the 
gel. 

At the mass level, the first goal is to determine the quality 
of a peak, that is, to determine when a peak may be con- 
sidered to be a "true" peak. For that purpose, parameters 
such as peak intensity, peak width or the peak's fit with a 
theoretical isotopic profile (see Section 2.2) can be used. 
A level of confidence is also defined for the match of an 
experimental mass with a theoretical mass in the protein, 
database. This is achieved with the help of parameters 
such as the number and type of modifications, the num- 
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hydrophobicity value (GRAVY) [26] of the corresponding 
peptide. The latter estimates the probability of finding the 
peptide in the mass spectrum (the hydrophobicity value is 
important for the ability of a peptide to fly in the mass 
spectrometer). 

At the second level, the protein level, we search in the set 
of candidate proteins for the protein showing the best cor- 
respondence with all information available from the gel 
and the spectrum. Values obtained at the mass level, as 
well as parameters describing the whole protein, can be 
used. Such parameters are the molecular mass and iso- 
electric point, but also the percentage of the protein 
sequence that is covered by peptides identified at level 1, 
or the standard deviation of the distance between theoret- 
ical and experimental peptide masses. 

The contextual level allows an adjustment of the identifi- 
cations obtained from the previous levels by taking the 
environment into account. For each position in the 2-DE 
gel where identification is attempted, the points in the 
neighborhood are considered. The distribution of the 
masses used for this identification, the distribution of the 
identified proteins, as well as of the parameters used in 
the previous steps are considered [25]. This method vali- 
dates or invalidates certain parameters, thus altering the 
results of the previous levels. In this way, one can imag- 
ine an iterative method that gradually refines the identifi- 
cation by successive application of the three levels. 

2.4.3 The algorithm 
2.4.3.1 Initial choices 

As we have seen, the choice of parameters as well as the 
point in time where they are used is decisive for the effi- 
ciency of the search. When choosing parameters the 
compromise between the sensitivity and selectivity of the 
search has to be considered. Moreover, the calculation of 
an identification score has to take into account the nature 
of parameters and the level at which they intervene. A 
preliminary study showed the importance of these param- 
eters (see Section 3.2.1). We therefore developed a new 
identification tool based on the role and the relative impor- 
tance of the various parameters, in order to determine a 
score allowing the best possible discrimination between 
the searched protein and the other candidate proteins. 
The algorithm limits the parameters with exploratory 
effects, while preserving enough sensitivity to be able to 
find most of the proteins. In that way, by strongly limiting 
the number of possible combinations arising from modifi- 
cation and missed cleavage parameters, one can obtain a 
fast and highly discriminant search algorithm, which does 
not produce too many candidates. The speed of the algo- 
rithm is also essential when it comes to automation of the 
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process for large scale identification projects. Only one 
missed cleavage is allowed, as well as the following modi- 
fications: cysteine carboxymethylation, acrylamide ad- 
ducts to cysteines and oxidized methionines. For these 
modifications, we permit only 0, 1 or all corresponding 
amino acids to be modified, in order to avoid too many 
combinations. Thus, the database can be parsed linearly 
and digested on-line, which avoids the use of a volumi- 
nous mass index. To improve efficiency, the database 
(SWISS-PROT and TrEMBL in FASTA format) have been 
split up into about 40 different sections, each of which 
contains the sequences of specific species or taxonomic 
category . A species tree was built that allows parsing only 
the part of the database corresponding to the user-speci- 
fied organism or range of organisms. Finally, we consider 
the whole set of parameters with filtering effects, in the 
hope to modulate their usage and thus avoiding the 
effects of fixed thresholds which too radically eliminate 
interesting candidate proteins. 

2.4.3.2 Definition of the score 

The main difficulties in the definition of a score calculation 
are to determine the most important parameters, their rel- 
ative weights and how to integrate the whole set of 
parameters into the score calculation. For this reason, we 
use the parameter levels defined above to determine a 
score calculation using their respective properties. 
Parameters of level 1 , the mass level, serve to calculate a 
score of level 1, associated with each matching peptide. 
For a given protein, the contribution of the parameters of 
level 1 is the sum of the level 1 scores of its peptides. It 
can be seen as an extension of the notion of number of 
matches used by most of the existing identification tools 
that count the number of experimental masses matching 
theoretical peptide masses of the candidate proteins. The 
more identified masses a protein has in the mass spec- 
trum, the higher is the confidence in its identification. 
While tools such as Peptldent and PeptideSearch assign 
a weight of either 0 or 1 to each peptide mass, depending 
on whether or not it is a match, our idea is that the weight 
associated with a peptide mass can be modified accord- 
ing to parameters of level 1. This gives an indication of 
the importance of a mass in the score calculation. We use 
four parameters at this level: the number of chemical 
modifications/the number of missed cleavages, the inten- 
sity of the corresponding peak in the mass spectrum, and 
the hydrophobicity coefficient. Then we calculate the first 
part of our score (S ) by: 

N 

S — X score- (/; 
where 

secret (a) = (coe^J^^coe^J^coe/KaJcoeWa) 
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where N is the number of matched peptides, score^a) 
the score of level 1 associated to peptide a, coef m the 
modification coefficient, n m (a) the number of modifica- 
tions in peptide a, coef c the missed cleavage coefficient, 
flc(a) the number of missed cleavages in peptide a, 
-coefi(a> the -peak intensity -coefficient of peptide a, and 
coef h (a) the hydrophobicity coefficient of peptide a. In this 
expression, the modification and missed cleavage coeffi- 
cients are fixed for all peptides and all proteins. However, 
their importance is increased with the power of the num- 
ber of modifications and missed cleavages that are pres- 
ent in the peptide, coef^(a) is proportional to the hydro- 
phobic^ of peptide a (the weaker the hydrophobicity, the 
higher coef h (a)), while coefia) is proportional to the peak 
intensity of peptide a (the higher the peak intensity, the 
higher coe/[(a)). 

Parameters of level 2 are used to compute coefficients 
that are then applied to the previously defined score. 
Indeed, at level 2, the parameters concern the whole pro- 
tein, so they have to directly modify the value of the score 
associated to the protein. Four parameters are used at 
level 2: the molecular weight of the protein (M e ), its iso- 
electric point (pi), a coverage coefficient (the percentage 
of the protein sequence covered by the matched pep- 
tides) and a standard deviation of the distances between 
experimental and theoretical masses. The score of level 2 
(S2) is calculated as: 



S 2 = 



1 



coe f 9 



coe f w coe f p coe f f 



(5) 



where coef & is the standard deviation coefficient, coe^, 
the molecular weight coefficient, coef p the isoelectric point 
coefficient and coef r the coverage coefficient. The crite- 
rion for considering the mass distance between experi- 
mental and theoretical masses for all matched masses is 
based on the fact that the more constant this distance is 
for all matched masses, the lesser is the likelihood that 
the matches happened randomly. This notion was refined 
to take into account the calibration error of the measuring 
device. Thus, we make a robust and iterative linear 
regression [27] upon all matched masses, and eliminate 
the masses that are too far from the regression line 
(which are more likely to be false matches). We then cal- 
culate the standard deviation of matched masses around 
this line. This regression is iterative as it is performed in 
several steps, each step eliminating the masses farthest 
from the regression straight line, the line then being recal- 
culated based on the new set of masses. The iteration is 
stopped when no mass has been eliminated in the previ- 
ous step, or when a given minimum number of masses is 
reached. The standard deviation calculated at this last 



step gives a hint of the correspondence between the 
mass alignment and the supposed spectrometer error. 
Moreover, one can expect that the linear regression com- 
pensates for some calibration errors occurring during the 
peak detection, thus stabilizing the overall algorithm- 
Mr and pi coefficients are nonlinear. We define several 
thresholds for the distance between experimental and 
theoretical values of M r and pi, and then associate a coef- 
ficient to each of these thresholds. The more the theoreti- 
cal values move away from the experimental values, the 
weaker the coefficient is. The coverage coefficient is pro- 
portional to the percentage of the sequence that is cov- 
ered, therefore the higher the percentage, the higher the 
coefficient. 

Finally, the total score associated to a protein is given by 
the expression: 



score =(S 1 ) ct § 2 



(6) 



where a is a weight showing the importance of parame- 
ters of level 1 against those of level 2. Parameters of level 
3 have not yet been taken into account, but they will be 
used within the scope of the "molecular scanner [25], 



2.4.3.3 The algorithm 

The algorithm (1) used for the identification can be sum- 
marized as follows in a pseudoprogramming language. 



2.5 The learning 

The score calculation and the peak detection that we use 
involve many coefficients (some also requiring several 
thresholds) that are associated with the various parame- 
ters. These coefficients determine the relative importance 
of each parameter in the score calculation, in order to be 
able to best discriminate the right protein from the other 
candidate proteins. We use a learning algorithm to deter- 
mine the coefficients and threshold values that allow the 
best discrimination. For this reason, the peak detection 
and identification parts of the algorithm have been unified 
to adjust all the parameters involved in the whole process, 
from spectrum analysis to identification. A genetic algo- 
rithm [28] has been applied to a training set of already 
identified proteins. This algorithm searches for the best 
coefficient values that allow the identification algorithm to 
identify the right protein, with its score being as distinct as 
possible from the scores of the following proteins in the 
ranked list of candidate proteins. 
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2.5.1 The genetic algorithm 

For the learning phase, 36 variables have been defined, 
representing all the parameters and thresholds needed to 
calculate the score. Among these 36 parameters, 33 have 
real values and only 3 have integer values. Therefore the 
parameters are coded as a vector (chromosome) or 36 
real values (genes). We use a nonlinear mutation opera- 
tor [29] for genetic algorithms working with real values. 
This operator decreases the mutation effect during gener- 
ations, favoring the convergence of values associated 
with genes. We use a classical crossover operator and a 
readjusting fitness function operator [28], thus avoiding 
an all too rapid convergence of the algorithm. We have 
developed an extension to the classical genetic algorithm 
that uses two populations with different convergence lev- 
els, which optimizes the quality of the results. The popula- 
tion with a high convergence level contains 26 chromo- 
somes and the one with a weak convergence level 
contains 44 chromosomes, each of them representing a 
set of particular parameters. We define a fitness function 
whose value characterizes how well our scoring function 
can discriminate the right protein. For the parameter val- 
ues of each chromosome, we apply peak detection and 
identification algorithms to a subset of spectra from the 
training set. The results of these algorithms are then used 
to calculate the fitness value associated with each chro- 
mosome as follows: 



value = 



0.5 - {posiUon(Rprot) -0.05; 



if Rqroi - proi, 
else 

■'71 



where score, is the score of the'/* protein from the list of 
results, Rprot the name of the searched protein, prot x the 
name of the I th protein from the list of results, and posi- 
tion(Rprot) the position of the right protein in the list. The 
total fitness of the chromosome is the average of these 
values for the subset of spectra. 



3 Results and discussion 

3.1 Peak detection 

Figure 5 shows a region of the SYP„ECOLI spectrum 
with peaks and their fils. Only a small fraction of the 
peaks could be interpreted as peptides of SYP, ECOLl; 
the other peaks may be due to impurities, protein frag- 
ments, or modifications. It reveals that small peaks may 
be important for identifying a protein. It is not clear a priori 
whether the simple classifier given by Eq. (3) is sufficient 
to separate *false ! peaks from true' ones. Therefore we 
performed a peak detection for the ten spectra used for 
testing the identification algorithm (Section 3.3) with 
threshold values of e max = 1 and rt mJ „ = 1, which were not 
too restrictive. We plotted lg(e) versus lg(£) (Fig. 6). This 
shows that there is a strong overlap between the 'true' 
and the lalse* peaks and the classifier (3) is not able to 
separate all of them. But the values of e max and h min given 
by the learning algorithm (Section 3.2) indicate that it is 
more important for the identification to consider all 'true' 
peaks, even if some lalse' ones mix in. Another result is 
the strong correlation between e and £, i.e., the higher the 
peaks the better the fit. This is mainly due to the fact that 
we normalized the error function (2) with respect to the 
height. 

3.2 The learning 

3.2.1 Preliminary study 

The influence of the effect of the main parameters upon 
the quality and speed of identification was studied. Stud- 
ies have been performed by others, but without the use of 
experimentation [30]. For our study, the Peptldent too! 
was employed to identify a set of 20 known proteins, each 
time varying the values of the available parameters. A first 
result showed the dominant importance of the filtering 
parameters, especially the choice of a specific species 
and to a lesser degree the information about molecular 
weight and isoelectric point. Without these parameters, 
the correct protein was often lost among a very large set 
of candidate proteins. The analysis also highlighted the 
strong effect of modifications and missed cleavage 
parameters upon the number of generated candidates. 
Indeed, Peptldent takes into account annotations from 
SWISS-PROT entries and the chemical modifications that 
represent a huge combination of the number of different 
masses that are possible for a single peptide mass. 
Therefore, the quality of the results often deteriorates 
when one allows the whole set of modifications or, even 
worse, if one allows one or two missed cleavages. The 
analysis showed that certain proteins (when very few pep- 
tides from the protein are found in the spectrum and when 
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Figure 5. Detected peaks in the SYP_ECOLI 
spectrum (dots) and their fits (solid line). The 
values of the error function e(m) were: 
e(1 179.561) - 0.37 X 10"" 3 , e(1 185.574) = 0.45 
X 10T 3 , e(1 193.584) = 2.63 X 10"°. 
6(1201.588) = 1.68 X 10"" 3 and e(1 207.533) = 
2.53 x 10*^. Only the peaks at m = 1 185.574 
and m = 1207.533 match a peptide of SYP_E- 
COLI considering only chemical modifications 
in Peptldent and one missed cleavage. 
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Figure 6. Ig(e) versus lg(/Vn) for the peaks of 
the ten test spectra with e max = 1 and n min = 1 . 
(a) Ig (e) versus \q(h/n) for the peaks that were 
considered as false' or at least uncertain after 
a visual examination of the spectra, (b) lg(e) 
versus \g(h/n) for the peaks that were consid- 
ered as 'true*, (c) lg(e) versus \Q(h/n) for the 
peaks that matched a peptide mass of the cor- 
responding protein. The solid lines indicate the 
values of e max = 0.5 and fc min = 2.2 obtained by 
the learning algorithm. 



they are modified or incompletely digested) cannot be 
found without the use of at least one of these parameters. 



3.2.2 Genetic algorithm 

We selected a set of 91 proteins with known identification 
(identified with at least two methods, including peptide 
mass fingerprinting, microsequencing, gel matching and 
amino acid composition analysis) as a training set. We 
carried out several learning phases, gradually increasing 



the number of parameters, the number of proteins in the 
training set, and varying certein parameters of the genetic 
algorithm that influence its convergence level. Each appli- 
cation of the peak detection and identification algorithm 
takes about 1 min (on an Ultra Sparc Station 5, Sun 
Microsystems Inc.), so it is not possible to test the whole 
set of 91 spectra for each chromosome. Instead, we ran- 
domly chose 20 spectra for each chromosome and defin- 
ed the fitness to be the average of their score. We can 
estimate the execution time of our learning algorithm for 
100 generations: 100 X 70 x 20 = 140 000 min (about 
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100 days). Due to the time needed for a complete execu- 
tion of the algorithm, we currently only have partial 
results. The version of our learning algorithm that uses 
the whole set of 36 parameters presented in this article 
and the 91 spectra of the data set was still running when 
the article was submitted. However, we present here the 
results -obtained by the previous execution, using 29 
parameters and 58 spectra during 24 generations. 

The parameters of this execution are: the two parameters 
of peak detection e max and h min , the parameter calibFact 
which influences the calibration weight in the peak detec- 
tion, mmMatch which gives the minimum number of 
matches necessary to consider a protein as a candidate 
deltaMass, the maximum allowed tolerance of masses' 
the two coefficients coefReg! and coefRegZ which are 
the thresholds for eliminating masses with linear regres- 
sion, the coefficient coefMiss applied to a missed cleav- 
age the coefficient coefModif applied to a modification 
the four coefficients MWcoefl to MWcoef4 applied to the 
deviation of molecular masses and associated to the four 
threshold parameters MWthresI to MWthresA for the 
deviation of molecular masses, the four coefficients 
Plcoen to PlcoefA applied to the deviation of isoelectric 
points and associated to the five threshold parameters 
Plthresl to PlthresSior the deviation of isoelectric points 
the two parameters nbMatchTh^ and nbMatchThresZ 
which determine the threshold at which the iteration on 
the linear regression is stopped, and power the weight 
applied to the parameters of level 1 against those of level 

Figure 7 shows the results obtained with these parame- 
ters. The first one. fitness, shows the algorithm conver- 
gence with three curves. The lowest one corresponds to 
the average fitness of the population with a weak conver- 
gence level, and the following one to the one with a high 
convergence level. We can see very good convergence 
of the population with a high convergence level with a 
maximum average fitness of 0.8123 for 26 chromosomes 
corresponding to 520 identifications. The uppermost 
curve g IV es the value of the best chromosome for each 
generation, with a maximum at 0.9315 (20 identifications) 
These chromosomes are used to determine the best 
parameters for the identification algorithm at the end ol 
the learning step. Therefore we present the evolution of 
the parameters of the best chromosomes in the following 
graphs. Note, however, that these results depend on the 
data used for the learning. The lack of variety in the data 
for a given parameter can cause a bias in the obtained 
results. In the future it will be important to repeat the 
learning with a larger set of data and as much diversity as 
possible. y 
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The parameters that present the clearest results are h„ 
minMatch and coefMiss. The high convergence of h 
shows lhat this parameter is important for the identifica'- 1 
tion. Its value implies the importance of considering all 
true peaks even if some fictitious ones mix in. Because 
of the strong correlation between h min and <w, peaks 
are primarily selected by h m>r „ and e max does not inter- 
vene. The results of minMatch clearly show that the 
exploratory, effects of this parameter are more important 
than its filtering effects, thus avoiding the loss of small 
prote.ns among the candidate proteins. The very weak 
values of the coefficient associated with the missed cleav- 
age prove that the combination due to the use of missed 
cleavages implies such a huge increase of false matches 
that the weight associated to peptides with missed clea^ 
vages must be drastically reduced. This corresponds to 
cases wuh very good digestion. We can also deduce that 
or an algorithm thai does not incorporate a penalizing 
factor for missed cleavage, it is preferable in case of good 
digestion not to use the possibility to allow for missed 
cleavages at all. 

Some other results are also rather clear, such as the high 
value of deltaMass that gives an important exploratory 
effect of level 1 in the matching of peptides that can be 
compensated for by the filtering effect of the linear regres- 
sion^used only a. level 2. The high values of parameter 
coefRegl show that it is preferable to eliminate masses 
only when they are far enough from the line ol the linear 
regression (level 1). In any case their weight is lowered by 
the value of the standard deviation (level 2). The values 
of parameter power can imply that the weight to give to 
the parameters of level 1, compared to those of level 2 
must be higher than what was allowed in this experience' 
A larger variation interval has therefore been permitted 
for th,s parameter in the new experimentation currently 
under way. The high variation of values of the calibFact 
parameter confirms the limited role of the calibration, due 
to the use of linear regression. 

Due to the small number of generations calculated the 
variability of ihe other parameters cannot be clearly 
explained yet. They may not have converged at the time 
of wnting, but one can probably say that they are less dis- 
cnmmant than those presented before. One more global 
conclusion is that the division of the score calculation into 
levels that allow considering parameters at various steps 
of the search process is very important to resolve conflicts 
between exploratory and filtering effects. The exploratory 
effects can be efficiently used (if they are not too costly 
as is the case for missed cleavages), if later stronq 
enough filtering effects are present to compensate for 
their effects. Thus, we obtain a search algorithm that 
studies a maximum of candidate proteins, while preserv- 
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ing sufficient discrimination power to bring out the right 
protein. 

3.3 Comparison 

We have developed an identification tool, Peptldent2, 
based on the method presented in this paper. In order to 
validate the method, we have undertaken a comparative 



10 15 20 



Figure 7. Learning of parame- 
ters. X axes correspond to the 
number of generations and y 
axes correspond to the values 
of parameters. 



study of the quality of protein identification obtained by 
several, identification tools. We compare the results of 
Peptldent2 with those of Peptldent, Mowse, ProFound, 
PeptideSearch and MS-Fit (see Section 1.2). For this, we 
took a set of ten mass spectra of proteins whose identifi- 
cations have been confirmed by microsequencing. For 
each protein, we show in Table 1 and 2 the identification 
result for each of the algorithms. For each identification 
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P56480 
P02088 
P01942 
P 17742 
P43024 
P10639 
P12787 
P27773 
P27773 
P38647 



Peptldent2 



Peptldent 



Mowse 



Propound 



1(335.5)12(9.38) 

1(205.2)12(23.9) 

1(9.85)12(2.76) 

1(9.81)12(0.29) 

1(16.21)12(12.72) 

1(2.31)12(1.72) 

1(41.54)12(10.61) 

1(209.69)12(11.82) 

1(557.6)12(121.2) 

1(737.0)12(84.62) 



1(15)12(7) 
1(22)19(94*) 

1(5)15(4^) 

1 (18)11 8(63ex) 



1(12 2e *)l2(11) 

1(27)117(154^ 

1(19)I5(13 2BX ) 



1(3.59" 10 )I8(1.42 7 ) 1(2.9 -'JlStS-r 4 ). 
1(1.5* 6 )!2(3.99^ 5 ) 1(5^T , li JI3(8.1 4 ) 
1(3.4 • , )I3(1.7 ') 



1(3.67 +5 )I8 



1(3.4 • 1 )!1 4(3.3 *) 



(2.88 4 ) 1(4.9 1 2ex )l3(4.3 ^ 
1(3.2 "-)\^\Jo A ) 
1(1.0)!2(1.9' 5 ) 
1(1.18* ,2 )l4(1.58- n ) 1(2.r , 26X )i3(1.9 ') 



PeptideSearch MS-Fit 



1(13)13(12) 
1(7 2e x)l3(6) 

K4cex)i5(3) 

1(5eex)l9(4) 

1(8)!3(7 2ex ) 
1(12)12(11) 



1(2.39 +5 )I2(126) 
1(4.83 +6 )l2(t'.62- 3 ) 
.1(65.3)13(49.8) 
1(189)12(19.9) 



1(7.15* 4 )I2(394) 

1(5.81 +4 )I2(2.96^) 

1(1.99 +6 )I2(8.23' 3 ) 



Table 2. 



Comparison of identification tools after user analysis 
Pepttdent2 Peptldent Mowse" 



P56480 1(335.5)12(9.38) 

P02088 1(205.2)12(23.9) 

P01942 1(9.85)12(1.11) 

P17742 1(9.81)12(0.19) 

P43024 1(16.21)12(12.72) 

P10639 1(2.31)12(1.72) 

P12787 1(41.54)12(10.61) 

P27773 1 (209.69)12(1 1.0 2ex ) 

P27773 1(557.6)12(121.2) 

P38647 1(737.0)12(23.02) 



Profound 



1(15)12(6) 

1(10)12(5) 

1(5)12(44^ 

1(4)12(3) 

1(6)12(4) 

1(3 7 ex) 

1(8)12(7) 

1(12)I2(9 2W ) 

1(14 ex )l2(13 2ex ) 

1(15)l2(10 5ex ) 



PeptideSearch MS-Fit 



1(1.42- 7 )I2(9.2 +4 ) 1(8.r 4 )i2(1.r 21 ) 
1(1.5* 6 )!2(1.8r 5 ) l(5.0"' 2ex )l3(8.1 4 ) 

1(4.45 +4 )I5(2.88 +4 ) 1(1 .V 1 )i12(3.3 4 ) 

1(4-9' 1 2ex )l3(4.3 3 ) 
1(2.3"')I2(1.6 _1 ) 
1(1.0)12(1.9 5 ) ' 
1(1 -58-") 1(2.1 1 2ex )!3(9.9 2 ) 



1(12)12(5) 
1(7 2ex )!3(5) 

1(4^)15(3) 
1(5 6e x)l9(4) 

1(7)12(5) 
1(12)12(11) 



1(2.39 A5 )f2(126) 
1(4.83* 6 )!2(1.62~ 3 ) 
1(65.3)13(49.8) 
1(189)12(19.9) 



1(7.1 5^ 4 )I2(394) 

1(5.8r 4 )J2(2.96- A ) 

1(1.99 +6 )I2(8.23~ 3 ) 



we also give the score value (between parentheses) of 
the first candidate protein followed by either the score 
value of the second candicate protein (if the first one is 
the right protein) or, otherwise, the rank and, in parenthe- 
ses, the score value of the right protein. The right protein 
is always displayed in bold type. The notation Xex means 
that the score of the corresponding protein is equal to that 
of Mother proteins, the algorithm not being able to give a 
clear discrimination. Finally, we note - ' if the right protein 
was not found among the first twenty candidate proteins. 

For these experiments, the parameters used in all identifi- 
cation programs were identical, if these parameters were 
available for each respective tool. The selected species 
was mouse, the allowed M r . variabiity was ± 50%. the 
allowed p/ variability was ± 1, the minimum number of 
matched masses was 3, the maximal tolerance for 
masses was 0.3 Da, at most one missed cleavage was 
allowed and the modifications taken into account were 
cysteine carboxymethylation and oxidized methionines. 
Table 1 gives "raw" results, that is, without user interpre- 
tation. In this table the databases used were SWISS- 
PROT and TrEMBL for Peptldent and Peptldent2, 



SWISS-PROT for MS-Fit, OWL for Mowse, nrdb for Pep- 
tideSearch and NCBInr for ProFound. Table 2 gives the 
results after a first analysis by an expert user in our labo- 
ratory, in particular to remove the proteins with species 
that did not correspond to the search, as Mowse, Pro- 
Found and PeptideSearch do not narrow down the search 
based on species. TrEMBL database was also removed 
for Peptldent and Peptldent2 tools to have a better com- 
parison with MS-Fit, which cannot use TrEMBL. 

The first thing we notice is the good identification obtained 
by Peptldent2 in both tables. In the second table, the right 
protein was identified in the first place in 9 out of 10 
cases, and with a large score discrimination (at least five- 
fold) in 6 out of 10 cases. The only protein that was not 
correctly identified was P10639, which ranked second in 
the list of results, with a score quite close to the one of the 
first protein. No other identification program correctly 
identified this protein, except for Peptldent which put it in 
first place with six other proteins of identical score. Peptl- 
dent globally allowed good identification when the 
TrEMBL database was not used, but with a much less 
clear discrimination than Peptldent2, and many proteins 
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were attributed identical scores. The other programs ob- 
tained variable results, the best being ProFound and MS- 
Fit. Note that programs that do not select a species 
(Mowse, ProFound and PeptideSearch) give result lists 
that are much larger, thus requiring a much larger manual 
analysis time- to select the right protein in the list. More- 
over, in this case, the risk is higher that the right protein 
does not appear at all on the list of results. One can also 
note that programs that use only the number of matched 
peptides (Peptldent and PeptideSearch) as their score 
have a much weaker discrimination power than the others 
and more often find proteins with identical scores, making 
the interpretation of the results by the user more difficult. 

We are now following this comparative study with a sec- 
ond one on a larger set of proteins and with various spe- 
cies in order to obtain better validation of the comparison. 
To preserve a maximum reliability in the comparison 
results, we plan to use only experiments in which MS 
identification has been at least confirmed by microse- 
quencing. 

4 Concluding remarks 

Protein identification and characterization is one of the 
most essential tasks performed in proteome research. 
The currently most widely used identification method 
compares the masses obtained from an MS spectrum of 
an enzymatically digested protein with the theoretical 
masses of proteins contained in an in silico digested pro- 
tein sequence database. The precise determination of the 
peptide masses in the spectra, and a highly discriminating 
mass comparison algorithm are therefore the keys to the 
accurate identification of proteins. We have developed a 
new tool to identify proteins from their peptide mass fin- 
gerprints. It comprises a fast and precise peak detection 
algorithm, as well as a new mass comparison and identifi- 
cation program, which is based on an advanced scoring 
method, both procedures being validated by an automatic 
learning algorithm. The analysis of the thresholds associ- 
ated with the peak detection has revealed that it is pref- 
erable to be little selective in the choice of peaks in the 
mass spectrum in order to avoid the loss of apparently fic- 
titious peaks that might eventually appear to be useful, 
provided the identification algorithm is able to discriminate 
'false' peaks from real ones. Our identification algorithm 
has proven to be robust enough in this respect. Also, the 
learning procedure has confirmed the advantage of a 
scoring scheme based on the balance between explora- 
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gain in the discrimination of the correct protein, in compar- 
ison to other identification algorithms. 

This work is now being extended by the development of a 
new version of the learning algorithm that will be able to 
. classify the proteins in the learning set simultaneously 
with the calculation of the parameter weights. This will 
determine several subsets of the parameter space, thus 
allowing an optimal discrimination of the scores. The goal 
is to determine several sets of parameters that will opti- 
mally discriminate the scores, no longer for all proteins, 
but rather for one subset of proteins that corresponds to a 
specific value of one of the experimental parameters 
(species, M n p/, etc.). Our score calculation will also be 
extended at the contextual level within the frame of the 
development of our molecular scanner. In addition, a new 
intermediate level, the "correlation level", will be intro- 
duced between the protein and the contextual level, which 
will consider information from several experiments carried 
out with different experimental conditions producing sev- 
eral fingerprints of the same sample. The correlation of 
these data will then validate the information obtained from 
the preceding levels. We thus expect to further improve 
the efficiency of our protein identification method. 

This work was supported by the Swiss National Fundior 
Scientific Research (grant 31-52974.97) and the Helmut 
Morten Foundation. The authors would like to thank Dr. 
Eva Jung for useful discussions and Luisa Tonella, Ger- 
ald Rosselat, Salvo Paesano and Abderrahim Karmime 
for preparing the samples and testing the software. 
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